📚 node [[k means|k means]]
Welcome! Nobody has contributed anything to 'k means|k means' yet. You can:
  • Write something in the document below!
    • There is at least one public document in every node in the Agora. Whatever you write in it will be integrated and made available for the next visitor to read and edit.
  • Write to the Agora from social media.
    • If you follow Agora bot on a supported platform and include the wikilink [[k means|k means]] in a post, the Agora will link it here and optionally integrate your writing.
  • Sign up as a full Agora user.
    • As a full user you will be able to contribute your personal notes and resources directly to this knowledge commons. Some setup required :)
⥅ related node [[image compression and quantisation with k means clustering]]
⥅ related node [[k means clustering]]
⥅ related node [[k means]]
⥅ node [[k-means]] pulled by Agora

k-means

Go back to the [[AI Glossary]]

#clustering

A popular clustering algorithm that groups examples in unsupervised learning. The k-means algorithm basically does the following:

  • Iteratively determines the best k center points (known as centroids).
  • Assigns each example to the closest centroid.
  • Those examples nearest the same centroid belong to the same group.

The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each example to its closest centroid.

For example, consider the following plot of dog height to dog width:

A scatter plot

If k=3, the k-means algorithm will determine three centroids. Each example is assigned to its closest centroid, yielding three groups:

A k-means cluster plot with centroids

Imagine that a manufacturer wants to determine the ideal sizes for small, medium, and large sweaters for dogs. The three centroids identify the mean height and mean width of each dog in that cluster. So, the manufacturer should probably base sweater sizes on those three centroids. Note that the centroid of a cluster is typically not an example in the cluster.

The preceding illustrations shows k-means for examples with only two features (height and width). Note that k-means can group examples across many features.

⥅ node [[k-means-clustering]] pulled by Agora

k-means clustering

Go to [[Week 2 - Introduction]] or back to the [[Main AI Page]]

Group examples from a data set into groups.

  • First, find a way to turn the examples into a vector
  • Second, use euclidian maths to find groups depending on those vectors' distance from central points
    • like lines of best fit but points of best fit if you scatter-plot the examples
    • note that sometimes there will be overlap among examples (roses video example)

A graphical representation of k-means clustering

Psuedo-code time!

#ToDo Read up on k-means and k-median clustering and research it all in more depth before trying again with the psuedocode. The below procedure is dramatically over-simplified.

The k-means algorithm is extremely simple to understand and implement. You begin by randomly assigning each example from the data set into a cluster, calculate the centroid of the clusters as the mean of all member examples, then iterate the data set to determine whether an example is closer to the member cluster or the alternate cluster . If the member is closer to the alternate cluster, the example is moved to the new cluster and its centroid recalculated. This process continues until no example moves to the alternate cluster.

The k-value is the number of clusters

#ToDo Research how to code k-means clustering properly.

For language reference, check out the [[Python - Main Page]]

# have a list of 30 vectors (x,y points) to plot
dataset = [[4 , 25], [20 , 22], [19 , 17], [8 , 8], [28 , 2], [9 , 17], [3 , 18], [8 , 1], [5 , 13], [21 , 19], [30 , 12], [18 , 29], [4 , 26], [15 , 5], [14 , 26], [18 , 3], [22 , 27], [15 , 20], [9 , 19], [2 , 30], [28 , 14], [15 , 19], [2 , 9], [26 , 8], [23 , 19], [28 , 4], [28 , 28], [3 , 10], [11 , 12]]

# have a list of 3 clusters to which they can be assigned, represented by dictionaries with a label and a vector
vectormap = [{"red" : [{"vector" : [8, 6]} {"members" :[]}]}, {"green" : [{"vector" : [20, 1]} {"members" :[]}]}, {"blue" : [{"vector" : [5, 5]} {"members" :[]}]}]

# have a function that randomly assigns each listitem in the dataset to one of the clusters
def cluster_func(dataset):
	for vector in dataset:
		clusterhome = get random number 0-2
		vectormap[clusterhome].get('members').append(vector)
	return vectormap
		
# have a function that finds a vector's closest cluster. Will probably have to make sure that dataset includes label so it's able to keep track of each vector.
def sub_vector(args*) # first vector is data vector, remaining are clusters
	closest vector = ""
	for vectors in args:
		if vector index = 0:
			break
		else:
			if (vector[0] - vector) < closest vector:
			closest vector = vector
	return closest vector

#have a function that iterates through all of a cluster's vectors and moves their vectors to the closest vector
def cluster_rehome(cluster):
	for member in cluster:
		vectormap[members].append(sub_vector())
	return vectormap

#have a function that checks if the distance of all vectors from their clusters is the closest of all three
def cluster_checker(vectormap)
	rehome_needed = False
	#IFUCKEDUP loop over clusters and build a list of distances
	for cluster in vectormap:
		for member in cluster:
			if (cluster.vector - member.vector) > distances in list of distances:
				rehome_needed = True
	return rehome_needed

#have a main function that handles the main loop
def main(dataset, vectormap):
	cluster_func(dataset)
	for vector in dataset:
		sub_vector(vector, vectormap[0].get('vector'), vectormap[1].get('vector'), vectormap[2].get('vector'))
	for cluster in vectormap:
		vectormap = cluster_rehome(cluster)
	if cluster_checker(vector_map):
		while cluster_checker(vector_map):
			for cluster in vector_map:
				cluster_rehome(cluster)
	plot vectormap with clusters and members
	
📖 stoas
⥱ context